perm filename VIS[0,BGB]14 blob
sn#113996 filedate 1974-08-06 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00017 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00003 00002 {⊂C<NαVISION THEORY.λ30P60I425,0JCFA} SECTION 6.
C00005 00003 ⊂6.1 A Geometric Feedback Vision System.⊃
C00007 00004 Between the top and the bottom, between images and the task
C00009 00005 {|λ10JUFA}
C00012 00006
C00017 00007 ⊂6.2 Vision Tasks.⊃
C00019 00008 First, there is the robot chauffer task. In 1969, John
C00023 00009
C00028 00010
C00030 00011 ⊂6.3 Vision System Design Arguments.⊃
C00037 00012
C00041 00013 ⊂6.4 Mobile Robot Vision.⊃
C00045 00014
C00049 00015 The solution to the cart problem, begins with the cart at a
C00052 00016
C00056 00017 ⊂6.5 Summary and Related Vision Work.⊃
C00064 ENDMK
C⊗;
{⊂C;<N;αVISION THEORY.;λ30;P60;I425,0;JCFA} SECTION 6.
{JCFD} COMPUTER VISION THEORY.
{λ10;W250;JAFA}
6.0 Introduction to Computer Vision Theory.
6.1 A Geometric Feedback Vision System.
6.2 Vision Tasks.
6.3 Vision System Design Arguments.
6.4 Mobile Robot Vision.
6.5 Summary and Related Vision Work.
{λ30;W0;I900,0;JUFA}
⊂6.0 Introduction to Computer Vision Theory.⊃
Computer vision concerns programming a computer to do a task
that demands the use of an image forming light sensor such as a
television camera. The theory I intend to elaborate is that general
3-D vision is a continuous process of keeping an internal visual
simulator in sync with perceived images of the external reality so
that vision tasks can be done more by reference to the simulator's
model and less by reference to the original images. The word
<theory>, as used here, means simply a set of statements presenting
a systematic view of a subject; specifically, I wish to exclude the
connotations that the theory is a natural theory of vision. Perhaps
there can be such a thing as an <artificial theory> which extends
from the philosophy thru the design of an artifact.
⊂6.1 A Geometric Feedback Vision System.⊃
Vision systems mediate between images and world models, these
two extremes of a vision system are called, in the jargon, the
<bottom> and the <top> respectively. In what follows, the word
<image> will be used to refer to the notion of a 2-D data structure
representing a picture; a picture being a rectangle taken from the
pattern of light formed by a thin lens on the nearly flat
photoelectric surface of a television camera's vidicon. On the other
hand, a <world model> is a data structure which is supposed to
represent the physical world for the purposes of a task processor. In
particular, the main point of this thesis concerns isolating a
portion of the world model (called the 3-D geometric world model) and
placing it below most of the other entities that a task processor has
to deal with. The vision hierarcy, so formed, is illustrated in box 6.1.
{|λ10;JA}
BOX 6.1 {JC} VISION SYSTEM HIERARCY.
{JC} Task Processor
{JC} |
{JC} Task World Model
The Top → {JC} |
{JC} 3-D Geometric Model
{JC} |
The Bottom → {JC} 2-D Images
{|λ30;JU}
Between the top and the bottom, between images and the task
world model, a general vision system has three distinguishable modes
of operation: recognition, verification and description.
Recognition vision can be characterized as bottom up. What is in the
picture is determined by extracting a set of features from the image
and by classifing them with respect to prejudices which must be taught.
Verification vision, also called top down or
model driven vision, involves predicting an image followed by
comparing the predicted image and a perceived image for differences
which are expected but not yet measure. Descriptive vision is bottom up
or data driven vision and involve converting the image into a
representation that makes it possible (or easier) to do the desired
vision task. I would like to call this kind of vision "revelation
vision" at times, although the phrase "descriptive vision" is the
term used by most members of the computer vision community.
{|λ10;JU;FA}
Box 6.2 {JC} THREE BASIC MODES OF VISION.
1. Recognition Vision - Feature Classification. (bottom up into a prejudiced top).
2. Verification Vision - Model Driven Vision. (nearly pure top down vision).
3. Descriptive Vision - Data Driven Vision. (nearly pure bottom up vision).
{|λ30;JU}
Now we have enough pieces to outline a system design.
By placing a 3-D geometric model in the gap; recognition vision can
be could be done on 3-D (rather than 2-D) features into the task world model;
and description vision and verification vision could be used to link
the 2-D and 3-D models in a relatively dumb, mechanical fashion.
Previous attempts to use recognition vision, to bridge directly the
large gap between 2-D images (of 3-D objects) and the task world
model, have been frustrated because the characteristic 2-D image
features of a 3-D object are very dependent on the 3-D physical
processes of occultation, rotation and illumination. It is these
processes that will have to be modeled and understood before the
features relevant to the task processor can be deduced from the
perceived images. The arrangement of these elements is diagramed
below.
{|λ10;JA}
Box 6.3 {JC} BASIC FEEDBACK VISION SYSTEM DESIGN.
{JC} Task World Model
{JC} ↑
{JC} RECOGNITION
{JC} ↑
{JC} 3-D geometric model
{JC} ↑ ↓
{JC} DESCRIPTION VERIFICATION
{JC} ↑ ↓
{JC} 2-D images
{|λ30;JU}
I wish to call attention to the lower part of the above diagram;
this portion is the feedback loop of the 3-D geometric vision system.
Depending on circumstances, the vision system should be able to run
almost entirely top-down (verification vision) or bottom-up
(revelation vision). Verification vision is all that is required in a
well know predictible environment; whereas, revelation vision is
required in a brand new (tabula rasa) or rapidly changing
environment. Thus revelation and verification form a loop,
bottom-up and top-down. First, there is revelation that unprejudically builds
a 3-D model; and second, the model is verified by
testing image features predicted from the assumed model. This loop
like structure has been noted before by others; it is a form of what
Tenebaum(71) called <accomodation> and it is a form of what Falk(69)
called <heuristic vision>; however I will go along with what I think
is the current majority of vision workers who call it <feedback vision>.
Completing the design, the images and worlds are
constructed, manipulated and compared by a variety of processors. The
topmost of which is the task processor. Since the task processor is
expected to vary with the application; it would be expedient if it
could be isolated as a user program that calls on utility routines
of an appropriate vision sub-system. Immediately below the task
processor are the 3-D recognition routines and the 3-D modeling
routines. The modeling routines underlie most everything because
they are used to create, alter and access the models.{
|;λ10;JAFA}
Box 6.4 {JC} PROCESSORS OF A 3-D VISION SYSTEM.
{↓}
0. The task processor.
1. 3-D recognition.
2. 3-D modeling routines.
3. Reality simulator.
{↑;W560;}
4. Image analyser.
5. Image synthesizer.
6. Locus solvers.
7. Comparators: 2D and 3D.
{|;λ30;JUFA}
The remaining processors include the reality simulator which
does mechanics for modeling motion, collision and gravity.
Also there are image analyzers, which do image enhancement and
conversions such as converting video rasters into line drawings.
There is an image synthesizer, which does hidden line and surface
elimination, for verification by comparing synthetic images from the
model with perceived images of reality. There are three kinds of
locus solvers that compute numerical descriptions for cameras, light
sources and physical objects. Finally, there is of course a large
number (at least ten) different compare processors for confirming or
denying correspondences among entities in each of the different kinds
of images and 3-D models.
⊂6.2 Vision Tasks.⊃
The 3-D vision research problem being discussed is that of
finding out how to write programs that can see in the real world.
Related vision problems include: modeling human
perception, solving visual puzzles (non-real world), and developing
advanced automation techniques (ad hoc vision). In order to approach
the problem, specific programming tasks are proposed and solutions
are sought. However please distingush the idea of a research problem
from that of a programming task; as will be illustrated, many vision
tasks can be done without vision. The vision solution to be found
should be able to deal with real images, should include the
continuity of the visual process in time and space, and should be
more general purpose and less ad hoc. These three requirements
(reality, continuity, and generality) will be developed by surveying
six examples of computer vision tasks.
{|;λ10;JAFA}
BOX 6.5{JC} TABLE OF 3-D COMPUTER VISION TASKS.
{↓}
<Cart Related Tasks>.
1. The Chauffeur Task.
2. The Explorer Task.
3. The Soldier Task.
{↑;W650;}
<Table Top Related Tasks>.
4. Turn Table Task.
5. The Blocks Task.
6. Machine Assembly Tasks.
{|;λ30;JUFA}
First, there is the robot chauffer task. In 1969, John
McCarthy asked me to consider the vision requirements of a computer
controlled car such as he depicted in an unpublished essay. The idea
is that a user of such an automatic car would request a destination;
the robot would select a route from an internally stored road map;
and it would then proceed to its destination using visual data. The
problem involves representing the road map in the computer and
establishing the correspondence between the map and the appearance of
the road as the automatic chauffer drived the vehicle along the
selected route. Lacking a computer controlled car, the problem was
abstracted to that of tracing a route along the driveways and parking
lots that surround the Stanford A.I. Laboratory using a television
camera and transmitter mounted on a radio controlled electric cart.
The robot chauffer task could be solved by non-visual means such as
by railroad like guidance or by inertial guidance; to preverse the
vision aspect of the problem, no particular artifacts should be
required along a route (landmarks must be found, not placed); and the
extent of inertial dead reckoning should be noted.
Second, there is the task of a robot explorer. In 1967,
McCarthy and Lederberg, published a description of a robot for
exploring the surface of the planet Mars (ref. **). The robot
explorer was required to run for long periods of time without human
intervention because the signal transmission time to Mars is as great
as twenty minutes and because the 23.5 hour Martian day would place
the vehicle out of Earth sight for twelve hour at a time. (This
latter difficulty could be avoided at the expense of having a set of
communication relay satellites in orbit around Mars). The task of the
explorer would be to drive around mapping the surface of Mars,
looking for interesting features, and doing various experiments. To
be prudent, a Mars explorer should be able to navigate without
vision; this can be done by driving slowly and by using a tactile
collision and crevasse detector. If the television system fails, the
core samples and so on can still be collected at different Martian
sites without unusual risk to the vehicle due to visual blindness.
The third vision task is that of the robot soldier, tank,
sentry, pilot or policeman. The problem has several forms which are
quite similar to the chauffeur and the explorer with the additional
goal of doing something to coerce an opponent. Although this vision
task has not yet been explicitly attempted at Stanford, to the best
of my knowledge, the reader should be warned that a thorough solution
to any of the other tasks almost assures the Orwellian technology to
solve this one.
Fourth, the turn table task is to construct a 3-D model from
a sequence of 2-D television images taken of an object rotated on a
turn table. The turntable task was selected as a simplification of
the explorer task and is an example of a nearly pure desriptive
vision task.
Fifth, the classic blocks vision task consists of two parts:
first convert a video image into a line drawing; second, make a
selection from a set of predefined prototype models of blocks that
accounts for the line drawing. In my opinion, this vision task
emphasives three pitfalls: single image vision, line drawings and
blocks. The greatest pitfall, in the usual blocks vision task, is the
presumption that a single image is to be solved; thus diverting
attention away from the two most important depth perception
mechanisms which are motion parallax and stereo parallax. The second
pitfall is that the usual notion of a perspective line drawing is not
a natural intermediate state; but is rather a very sophisticated and
platonic geometric idea. The perfect line drawing lacks photometric
information; even a line drawing with perfect shadow lines included
will not resemble anything that can readily be gotten by processing
real television pictures. Curiously, the lack of success in deriving
line drawings from real television images of real blocks has not
dampened interest in solving the second part of the problem. The
perfect line drawing puzzle, was first worked on by Guzman and
extended to perfect shadows by Waltz; nevertheless, enough remains so
that the puzzle will persist on its own merits, without being
closely relevant to real world computer vision. Even assuming that
imperfect line drawings are given, the the blocks themselves, have
lead such researchers as Falk and Grape to concentrate on vertex/edge
classification schemes which have not be extended beyond the blocks
domain. The blocks task could be rehabilitated by concentrating on
photometric modeling and the use multiple images for depth
perception.
Sixth, the Stanford Artifical Intelligence Laboratory has
recently (1974) begun work on a National Science Foundation Grant
supporting research in automatic machine assembly. In particular,
effort will be directed to developing techniques that can be
demonstrated by automatically assemblying a chain saw gasoline
engine. Two vision questions in such a machine assembly task are
where is the part and where is the hole; these questions will be
initially handled by composing ad hoc part and hole detectors for
each vision step required for the assembly.
The point of this task survey was to delimit what is and is
not a task requiring real 3-D vision; and to point out that caution
has to be taken to preserve the vision aspects of a given task. In
the usual course of vision projects, a single task or a single tool
unfortunately dominates the research; my work is no exception, the
one tool is 3-D modeling, and the task that dominated the formative
stages of the research is that of the robot chauffered cart. A
better understanding of the ultimate nature of computer vision can be
obtained by keeping the several tasks and the several tools in mind.
⊂6.3 Vision System Design Arguments.⊃
The physical information most directly relevant to vision is
the location, extent and light scattering properties of solid opaque
objects; the location, orientation and projection of the camera that
takes the pictures; and the location and nature of the light that
illuminates the world. The transformation rules of the everyday
world that a programmer may assume, a priori, are the laws of
physics. The arguments against geometric modeling, divide
into two catagories: the reasonable and the intuitive.
The reasonable arguments attack 3-D geometric modeling by
comparing it to another modeling alternative, (some alternatives are
listed in the box immediately below). Actually, the domains
of efficiency of the possible kinds of models do not
greatly overlap; and an artificial intellect will have some portion of
each kind. Nevertheless, I feel that 3-D geometric modeling is
superior for the task at hand, and that the other models are less
relevant to vision.{Q}
{|;λ10;JAFA}
BOX 6.6{JVJC} Alternatives to 3-D Geometric Modeling in a Vision System.
1. Image memory and with only the camera model in 3-D.
2. Statistical world model, e.g. Duda & Hart.
3. Procedural Knowledge, e.g. Hewett & Winograd.
4. Semantic knowledge, e.g. Wilkes & Shank.
5. Formal Logic models, e.g McCarthy & Hayes.
6. Syntactic models.
{|;λ30;JUFA}
Perhaps the best alternative to a 3-D geometric model is to have a
library of little 2-D images describing the appearance of various 3-D
loci from given directions. The advantage would be that a
sophisticated image predictor would not be required; on the other
hand the image library is potentially quite large and that even with
a huge data base new views and lighting of familair objects and
scenes can not be anticipated.
The statistical model, is quite relevant to vision and can be
added to the geometric model. However, the statistical model can not
stand alone because the processes of occultation, rotation and
illumination make the approach infeasible.
Procedural knowledge models represent the world in terms of
routines (or actors) which either know or can compute the answer to a
question about the world. Semantic models represent the world in term
of a data structure of conceptual statements; and formal logic models
represent the world in terms of first order predicate calculus or in
terms of a situation calculus. The procedural, semantic and formal
logic world models are all general enough to represent a
vision model and in a theoretical sense they are merely other notations
for 3-D geometric modeling. However in practice, these three
modeling regimes are not efficient holders and handlers of
quantitative geometric data; but are rather intended for a higher
level of abstract reasoning. Another alleged advantage of these
higher models is that they can represent partial knowledge and
uncertainty, which in a geometric model is implicit, in that
structures are missing or incomplete. For example, McCarthy and
Feldman demand that when a robot has only seen the front of an office
desk that the model should be able to draw inferences about the back
of the desk; I feel that this so called advantage is not required by
the problem and that basic visual modeling is on a more agnostic
level.
The syntactical approach to descriptive vision is that an
image is a sentence of a picture grammar and that consquently the
image description should be given in terms of the sequence of grammar
transformations rules. Again this paradigm is valid in principle but
impractical for real images of 3-D objects because simple
replacements rules can not readily express rotation, perspective,
and photometric transformations. On the other hand, the syntactical
models have been of some use in describing 2-D shapes. (Gipps, 74).
The intuitive arguments include the opinions that geometric
modeling is too numerical, too exact, or too non-human to be relevant
for computer vision research. Against such intuitions, I wish to pose
two fallacies. First, there is the natural mimicry fallay, which is
that it is false to insist that a machine must mimic nature in order
to achieve its design goals. Boeing 747's are not covered with
feathers; trucks do not have legs; and computer vision need not
simulate human vision. The advocates of the uniqueness of natural
intellegence and perception will have to come up with a rather
unusual uniqueness proof to establish their conjecture. In the
meantime, one should be open minded about the potential forms a
perceptive counsciousness can take.
Second, there is the self introspection fallacy, which is
that it is false to insist that one's introspections about how he
thinks and sees are direct observations of thought and sight. By
introspection some conclude that the visual models (even on a low
level) are essentially qualitative rather than quantative. My belief
is that the vision processing of the brain is quite quantitative and
only passes into qualities at a higher level of processing. In either
case, the exact details of human visual processing are inaccessible
to conscious self inspection.
Although, describing the above two fallacies might soften a
person's prejudice against numerical geometric modeling, some
important argument or idea is missing that would convince the so
prejudiced of the importance of numerical models prior to the full
achievement of computer vision (vice versa, I have not heard an
argument that would change my prejudice in favor of such models).
This matter of conflicting intuitions would not be important, were
it not that the "they" include so many of my immediate collegues. (Of
course, I may well be proved wrong if really powerful 3-D computer
vision systems are ever built without using any geometric models
worth speaking of, perhaps employing an elaborate stimulus response
paradigm).
⊂6.4 Mobile Robot Vision.⊃
The elements discussed so far will now be brought together
into a system design for performing mobile robot vision. The proposed
system is illustrated below in the block diagram in box (6.7). (The
diagram is called a mandala in that
a <mandala> is any circle-like system diagram). Although, the robot
chauffered cart was the main task theme for this research; I have
failed to date, August 1974, to achieve the hardware and software
required to drive the cart around the laboratory under its own
control. Nevertheless, this necessarily theoretical cart system has
been of considerible use in developing the visual 3-D modeling
routines and theory, which are the subject of this thesis.
{|;JV;FA}
BOX 6.7{JC} CART VISION MANDALA.
{W300;λ4;F2}
→→→→→→→→→→→→→→→→→→→ PERCEIVED →→→→→→ REALITY →→→→→→ PREDICTED →→→→
↑ WORLD SIMULATOR WORLD ↓
↑ ↓
↑ ↓
↑ PERCEIVED →→→→→→ CART →→→→→→→→ PREDICTED →→→↓
↑ CAMERA LOCUS DRIVER CAMERA LOCUS ↓
↑ ↑ ↓ ↓
↑ ↑ ↓ ↓
↑ ↑ THE CART PREDICTED→→→→↓
BODY CAMERA SUN LOCUS ↓
LOCUS LOCUS ↓
SOLVER SOLVER ↓
↑ ↑ ↓
↑ ↑ ↓
REVEAL VERIFY IMAGE
COMPARE COMPARE SYNTHESIZER
↑ ↑ ↑ ↑ ↓
↑ ↑ ↑ ↑ ↓
↑ ←← PERCEIVED→→→→→↑ ↑←←←←←←←←←←←←←←←←←←←← PREDICTED ←←←←←←←↓
←←←←← MOSAIC IMAGE MOSAIC IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ↓
CONTOUR IMAGE CONTOUR IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ←←←←←←←←←
VIDEO IMAGE VIDEO IMAGE
↑
↑
↑
TELEVISION
CAMERA
{|;λ30;JUFA}
The robot chauffer task involves establishing the
correspondence between an internal road map and the appearance of the
road in order to steer a vehicle along a predefined path. For a first
cut, the planned route is assumed to be clear, and the cart and the
sun are assumed to be the only movable things in a static world.
Dealing with moving obstacles is a second problem, motion thru a
static world must be dealt with first.
The cart at the Stanford Artificial Intelligence Laboratory
is intended for outdoors use and consists of a piece of plywood, four
bicycle wheels, six electric motors, two car batteries, a television
camera, a television transmitter, a box of digital logic, a box of
relays, and a toy airplane radio receiver. (The vehicle being
discussed is not "Shakey", which belongs to the Stanford Reseach
Institute's Artificial Intelligence Group. There are two A.I. labs
near Stanford and each has a computer controlled vehicle). The six
possible cart actions are: run forwards, run backwards, steer to the
left, steer to the right, pan camera to the left, pan camera to the
right. Other than the television camera, there is no telemetry
concerning the state of the cart or its immediate environment.
The solution to the cart problem, begins with the cart at a
known starting position with a road map of visual landmarks with
known loci. That is, the upper leftmost two rectangles of the cart
mandala are initialized so that the perceived cart locus and the
perceive world correspond with reality. Flowing across the top of
the mandala, the cart driver, blindly moves the cart forward along
the desired route by dead reckoning (say the cart moves five feet and
stops) and the driver updates the predicted cart locus. The reality
simulator is an identity in this simple case because the world is
assumed static. Next the image synthesizer uses the predicted world,
camera and sun to compute a predicted image containing the landmarks
features expected to be in view. Now, in the lower left of the
mandala, the cart's television camera takes a perceived picture and
(flowing upwards) the picture is converted into a form suitable for
comparing and matching with the predicted image. Features that are
both predicted and perceived and found to match are used by the
camera locus solver to compute a new perceived camera locus (from
which the cart locus can be deduced). Now the cart driver compares
the perceived and the predicted cart locus and corrects its course
and moves the cart again, and so on.
{|;λ10;JAFA}
BOX 6.8 {JC} A POSSIBLE CART TASK SOLUTION.
1. Predict (or retrieve) 2D image features.
2. Perceive (take) a television picture and convert.
3. Compare (verify) predicted and perceived features.
4. Solve for camera locus.
5. Servo the cart along its intended course.
{|;λ30;JUFA}
The remaining limb of the cart mandala is invoked in order to
turn the chauffer into an explorer. Perceived images are compared
thru time by the reveal compare and new features are located by the
body locus solver and placed into the world model.
The generality and feasibility of such a cart system
depends almost entirely on the representation of the world and the
representation of image features. (The more general, the less
feasible). Although, the bulk of the rest of this document developes
polyhedral representation for the sake of photometric generality;
four simpler cart systems could be realized by using simpler models.
A first system, consists of a road map, a road model, a road
model generator, a solar emphemeris, an image predictor an image
comparator, a camera locus solver, and a course servo routine. The
roadways and nearby environs are entered into the computer. In fact,
real roadways are constructed from a two dimensional (X,Y) allignment
map showing way the center of the road goes as a curve composed of
line segement and circular arcs; and a second two dimensional (S,Z)
elevation diagram; showing the height of the surface above sea level
as a funtion of distance along the road; as a sequence of linear
grades and vertical arcs which (not too surprising) are nearly cubic
splines. A second version, is like the first except the road model,
road model generator, and image predictor are replaced by a library
of road images. In this system the robot vehicle is trained by
being driven down the roads it is suppose to follow. A third system
is like the first except that the road map is not initially given,
and indeed the road is no longer presumed to exist. Part of the
problem becomes finding a road, a road in the sense of a clear area;
this version yeilds the cart explorer and if the clear area is found
quite rapidly and the world is updated quite frequently, the explorer
can be a chauffer that can handle obstacles and moving objects. The
fourth system is like the third, except that the world is modeled by
a single valued surface elevation function, rather than by a
polyhedral model.
⊂6.5 Summary and Related Vision Work.⊃
To recapitulate, three vision system design requirements were
postulated: reality, generality, and continuity. These requirements
were illustrated by discussing a number of vision related tasks.
Next, a vision system was described as mediating between 2-D images
and a world model; with the world model being further broken down
into a 3-D geometric model and a task world model. Between these
entities three basic vision modes were identified: recognition,
verification and revelation (description). Finally, the general
purpose vision system was depicted as a quantitative and description
oriented feedback cycle which maintain a 3-D geometric model for the
sake of higher qualitative, symbolic, and recognition oriented task
processors.
Approaching the vision system in greater detail; the role of
seven (or so) essential kinds of processors were explained: the task
processor, 3-D modeling routines, reality simulator, image
analyser, image synthesizer, comparators, and locus solvers. The
processors and data types were assembled into a cart chauffer system.
Larry Roberts is justly credited for doing the seminal work
in 3-D Computer Vision; although his thesis appeared over ten years
ago the subject has languished dependent on and overshadowed by the
four areas called: Image Processing, Pattern Recognition, Computer
Graphics, and Artificial Intelligence. Outside the computer
sciences, workers in psychology, neurology and philosophy also seek a
theory of vision.
Image Processing involves the study and development of
programs that enhance, transform and compare 2D images. Nearly all
image processing work can eventually be applied to computer vision in
various circumstances. A good survey of this field can be found in an
article by Rosenfeld(69). Image Pattern Recognition involves two
steps: feature extraction and classification. A comprehensive text
about this field with respect to computer vision, has been written by
Duda and Hart(73). Computer Graphics is the inverse of discriptive
computer vision. The problem of computer graphics is to synthesis
images from three dimensional models; the problem of discriptive
computer vision is to analyze images into three dimensional models.
An introductory text book about this field would be that of Newman
and Sproull(73). Finally, there is Artificial Intelligence, which in
my opinion is an institution sheltering a heterogenous group of
embryonic computer subjects; the biggest of the present day orphans
include: robotics, natural language, theorem proving, speech
analysis, vision and planning. A more narrow and relevant definition
of artificial intelligence is that it concerns the programming of the
robot task processor which sits above the vision system. There is no
general reference on Artificial Intelligence that I wish to
recommend.
The related vision work of specific individuals has already
been mention in context. To summarize, the present vision work is
related to the early work of Roberts(63) and Sutherland(63); to the
recent work at Stanford: Falk, Feldman and Paul(67), Tenenbaum(72),
Agin(72), Grape(73); to the work at MIT: Guzman, Horn, Waltz,
Krakaurer; to the work at the University of Utah: Warnock, Watkins;
and to work at other places: SRI and JPL. Future progress in computer
vision will proceed in step with better computer hardware, better
computer graphics software, and better world modeling software.
Future vision work at Stanford, which is related to the present
theory will be done by Lynn Quam and Hans Morevac. At JPL and SRI,
similar work on vehicle vision work is being done. The machine
assembly task is being pursued both by the Artificial Intelligence
Group of the Stanford Research Institute and by the Hand Eye Project
at Stanford University. Because the demand for doing practical vision
tasks can be satisfied with existing ad hoc methods or by not using a
visual sensor at all; I expect little or no vision progress per se
from such reseach, although their demonstrations should be robotic
spectaculars. Since, the missing ingredient for computer vision is
the spatial modeling to which perceive images can be related; I
believe that the development of the technology for generating
commercial film and television by computer for entertainment will
make significant contribution to computer vision.{Q}